In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 10
In [2]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data).set_index('country')
countries
Out[2]:
Let's define a table with natural increase rates in 2013 (data from World Bank):
In [3]:
death_rate = pd.Series([10, 9, 11, 8, 9],
index=['Poland','United Kingdom', 'Germany', 'Netherlands', 'France'])
print(death_rate)
In [4]:
birth_rate = pd.Series([10, 9, 10, 12],
index=['Netherlands', 'Germany', 'Poland', 'France'])
print(birth_rate)
Now we calculate the natural increae by subtracting death rate from birth rate:
In [5]:
natural_increase = birth_rate - death_rate
print(natural_increase)
Note that the rows where the two series did not overlap contain missing values (NaN
= Not a Number) and that the data were properly aligned on the index.
In [6]:
pop_change = pd.DataFrame({'death rate' : death_rate,
'birth rate' : birth_rate,
'natural increase' : natural_increase})
We can remove the missing data using dropna
method:
In [7]:
pop_change.dropna(inplace=True)
In [8]:
pop_change
Out[8]:
In [ ]:
In [ ]:
Let's now try to add the data to the country data:
In [11]:
countries.join(pop_change)
Out[11]:
There are four different ways we can handle the missing rows:
Note that the methods are similar to SQL JOIN clause.
In [12]:
countries.join(pop_change, how='right')
Out[12]:
In [ ]:
By "group by" we are referring to a process involving one or more of the following steps
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [14]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Out[14]:
In [15]:
df.groupby('key').aggregate('sum') # np.sum
Out[15]:
In [16]:
df.groupby('key').sum()
Out[16]:
You can also simply count members of each split:
In [17]:
df.groupby('key').size()
Out[17]:
These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [18]:
cast = pd.read_csv('data/cast.csv')
cast[10:15]
Out[18]:
In [19]:
titles = pd.read_csv('data/titles.csv')
titles.head()
Out[19]:
In [ ]:
In [ ]:
You can also group by the values on another array under the condition that this array has the length equal to the number of rows:
In [22]:
greek = ['α', 'β', 'β', 'β', 'β', 'α', 'β','α', 'α']
df.groupby(greek).max()
Out[22]:
The values for the grouping array can be also computed from values in the data frame. For example, to count odd and even number in the data column we could simply:
In [23]:
df.groupby(df['data'] % 2).size()
Out[23]:
In [ ]:
In [ ]:
Note that you can also groupby on multiple keys:
In [26]:
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)
In [27]:
df.groupby(['type', 'key']).sum()
Out[27]:
In [28]:
df['type'] = np.where(df['data'] % 2, 'odd', 'even')
print(df)
df['data']
Out[28]:
Note that it creates a hierarchical index. More on that later.
In [ ]:
In [ ]:
In [ ]:
A useful shortcut to calculate the number of occurences of certain values is value_counts
(this is somewhat equivalent to df.groupby(key).size())
)
For example, what are the most occuring movie titles?
In [32]:
titles.title.value_counts().head()
Out[32]:
In [ ]:
Aggregate function could be any function accepting the Series
object.
For example, let's calculate most frequent apperances in each year of last decade:
In [34]:
def most_frequent(x):
return x.value_counts().index[0]
In [35]:
cast.loc[(cast['year'] >= 2010) & (cast['year'] < 2020), ['year', 'name']].groupby('year').agg(most_frequent)
Out[35]:
© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).
© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)
Licensed under CC BY 4.0 Creative Commons
In [ ]: